Premium paid by the customer is the major revenue source for insurance companies. Default in premium payments results in significant revenue losses and hence insurance companies would like to know upfront which type of customers would default premium payments.
The objective of this project
Problem Statement
“Explore the data to identify customers with the propensity to default on the premiums to be paid to the Insurance company. Identify the factors/ characteristics influencing these default behavior to predict the probability of who will default, so as to help the Insurance Agents to proactively reach out these customers with the right strategy/approach to make sure they pay their due premiums”
Need of the present study and business/social opportunity
This Project submission Part 1 will be about:
1. Description of the Data 2. Initial Exploratory Data Analysis
Data Report Understand how data was collected in terms of time, frequency, methodology
P.S. If possible, we should approach the insurance company (client) to share some more details with respect to the identified information missing in the dataset to help us analyze further. Some of those details could include Gender, Type of Insurance purchased, premium frequency,
The following steps have been undertaken when exploring the dataset:
Environment Set up and Data Import
Variable Identification
Univariate Analysis
Bi-Variate Analysis
Variable Transformation / Feature Creation
Conclusions
knitr::opts_chunk$set(error = FALSE, # suppress errors
message = FALSE, # suppress messages
warning = FALSE, # suppress warnings
echo = FALSE, # suppress code
cache = TRUE) # enable caching
The dataset contains the following information about 79854 policy holders:
Observation:
VIEW PLOT: The plot structure reveals the various variables of the data frame
## [1] 79853 17
DIMENSIONS: shows Columns = 17 and Rows = 79, 853
HEAD & TAIL: A uniform similarity is observed in the data frame’s columns and rows using the “head’ &”tail" function.
## tibble [79,853 × 17] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:79853] 1 2 3 4 5 6 7 8 9 10 ...
## $ perc_premium_paid_by_cash_credit: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
## $ age_in_days : num [1:79853] 11330 30309 16069 23733 19360 ...
## $ Income : num [1:79853] 90050 156080 145020 187560 103050 ...
## $ Count_3-6_months_late : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
## $ Count_6-12_months_late : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
## $ Count_more_than_12_months_late : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
## $ Marital Status : num [1:79853] 0 1 0 1 0 0 0 0 1 1 ...
## $ Veh_Owned : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
## $ No_of_dep : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
## $ Accomodation : num [1:79853] 1 1 1 0 0 0 1 0 1 1 ...
## $ risk_score : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
## $ no_of_premiums_paid : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
## $ sourcing_channel : chr [1:79853] "A" "A" "C" "A" ...
## $ residence_area_type : chr [1:79853] "Rural" "Urban" "Urban" "Urban" ...
## $ premium : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
## $ default : num [1:79853] 1 1 1 1 0 1 1 1 1 1 ...
STRUCTURE OF DATASET: There are some variables in the data set which are numerical in nature whose format needs to be changed for proper analysis.
PLOT DIMENSIONS:
– The plot intro shows 12% of the columns are discrete in nature while 88% are continuous. This will change as the formats of some variables will be changed for analysis. – There are no missing columns or rows and no missing observations, which indicates the data is uniform and complete with no undesired discrepancies.
## id perc_premium_paid_by_cash_credit age_in_days
## Min. : 1 Min. :0.0000 Min. : 7670
## 1st Qu.:19964 1st Qu.:0.0340 1st Qu.:14974
## Median :39927 Median :0.1670 Median :18625
## Mean :39927 Mean :0.3143 Mean :18847
## 3rd Qu.:59890 3rd Qu.:0.5380 3rd Qu.:22636
## Max. :79853 Max. :1.0000 Max. :37602
## Income Count_3-6_months_late Count_6-12_months_late
## Min. : 24030 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 108010 1st Qu.: 0.0000 1st Qu.: 0.00000
## Median : 166560 Median : 0.0000 Median : 0.00000
## Mean : 208847 Mean : 0.2484 Mean : 0.07809
## 3rd Qu.: 252090 3rd Qu.: 0.0000 3rd Qu.: 0.00000
## Max. :90262600 Max. :13.0000 Max. :17.00000
## Count_more_than_12_months_late Marital Status Veh_Owned
## Min. : 0.00000 Min. :0.0000 Min. :1.000
## 1st Qu.: 0.00000 1st Qu.:0.0000 1st Qu.:1.000
## Median : 0.00000 Median :0.0000 Median :2.000
## Mean : 0.05994 Mean :0.4987 Mean :1.998
## 3rd Qu.: 0.00000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :11.00000 Max. :1.0000 Max. :3.000
## No_of_dep Accomodation risk_score no_of_premiums_paid
## Min. :1.000 Min. :0.0000 Min. :91.90 Min. : 2.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:98.83 1st Qu.: 7.00
## Median :3.000 Median :1.0000 Median :99.18 Median :10.00
## Mean :2.503 Mean :0.5013 Mean :99.07 Mean :10.86
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:99.52 3rd Qu.:14.00
## Max. :4.000 Max. :1.0000 Max. :99.89 Max. :60.00
## sourcing_channel residence_area_type premium default
## Length:79853 Length:79853 Min. : 1200 Min. :0.0000
## Class :character Class :character 1st Qu.: 5400 1st Qu.:1.0000
## Mode :character Mode :character Median : 7500 Median :1.0000
## Mean :10925 Mean :0.9374
## 3rd Qu.:13800 3rd Qu.:1.0000
## Max. :60000 Max. :1.0000
## [1] "id" "perc_premium_paid_by_cash_credit"
## [3] "age_in_days" "Income"
## [5] "Count_3-6_months_late" "Count_6-12_months_late"
## [7] "Count_more_than_12_months_late" "Marital Status"
## [9] "Veh_Owned" "No_of_dep"
## [11] "Accomodation" "risk_score"
## [13] "no_of_premiums_paid" "sourcing_channel"
## [15] "residence_area_type" "premium"
## [17] "default"
SUMMARY OF THE DATASET:
– The summary shows variables namely Marital Status, Vehicles Owned, Number of Dependents, Accommodations & Default will need to changed to factors for a correct representation of what the data is displaying. – Age is also displayed in ’Days" which need to be changed to ’Years" – After changing the characters of the mentioned variables, we will further explore the summary in detail.
## id perc_premium_paid_by_cash_credit
## 0 0
## age_in_days Income
## 0 0
## Count_3-6_months_late Count_6-12_months_late
## 0 0
## Count_more_than_12_months_late Marital Status
## 0 0
## Veh_Owned No_of_dep
## 0 0
## Accomodation risk_score
## 0 0
## no_of_premiums_paid sourcing_channel
## 0 0
## residence_area_type premium
## 0 0
## default
## 0
MISSING VALUE PLOT: This functions details out that there are no missing values in any of the columns (variables)
VARIABLE TRANSFORMATION: The “Age in Days” column is used to create a new variable column of ’Age" which is represented in years.
FEATURE CREATION: The “age” - in years is used to create a new feature variable i.e. Age Group. The age component is slotted into eight buckets from lowest to highest in ascending order. The eight age groups are as follows
DROPPING THE UNWANTED VARIABLES FROM THE DATA SET:
– “Age in days” is removed and replaced instead by “Age in Years’ –”ID" is removed as it won’t be serving any purpose in the analysis.
reducing the income variable by diving it by 1000 to have an adaptable range for analyzing
CHANGING VARIABLES AS FACTORS: The variables, namely Marital Status, Vehicles Owned, Number of Dependents, Accommodations & Default are appropriately changed to factors to make them discrete observations.
## tibble [79,853 × 17] (S3: tbl_df/tbl/data.frame)
## $ perc_premium_paid_by_cash_credit: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
## $ Income : num [1:79853] 90 156 145 188 103 ...
## $ Count_3-6_months_late : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
## $ Count_6-12_months_late : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
## $ Count_more_than_12_months_late : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
## $ Marital Status : Factor w/ 2 levels "Married","Not Married": 2 1 2 1 2 2 2 2 1 1 ...
## $ Veh_Owned : Factor w/ 3 levels "1","2","3": 3 3 1 1 2 1 3 3 2 3 ...
## $ No_of_dep : Factor w/ 4 levels "1","2","3","4": 3 1 1 1 1 4 4 2 4 3 ...
## $ Accomodation : Factor w/ 2 levels "Owned","Rented": 1 1 1 2 2 2 1 2 1 1 ...
## $ risk_score : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
## $ no_of_premiums_paid : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
## $ sourcing_channel : Factor w/ 5 levels "A","B","C","D",..: 1 1 3 1 1 2 3 1 1 1 ...
## $ residence_area_type : Factor w/ 2 levels "Rural","Urban": 1 2 2 2 2 1 1 2 2 1 ...
## $ premium : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
## $ default : Factor w/ 2 levels "Defaulted","Not Defaulted": 2 2 2 2 1 2 2 2 2 2 ...
## $ age : int [1:79853] 31 82 43 64 53 45 44 39 75 81 ...
## $ agegroup : num [1:79853] 2 7 3 5 4 3 3 2 6 7 ...
## perc_premium_paid_by_cash_credit Income Count_3-6_months_late
## Min. :0.0000 Min. : 24.03 Min. : 0.0000
## 1st Qu.:0.0340 1st Qu.: 108.01 1st Qu.: 0.0000
## Median :0.1670 Median : 166.56 Median : 0.0000
## Mean :0.3143 Mean : 208.85 Mean : 0.2484
## 3rd Qu.:0.5380 3rd Qu.: 252.09 3rd Qu.: 0.0000
## Max. :1.0000 Max. :90262.60 Max. :13.0000
## Count_6-12_months_late Count_more_than_12_months_late Marital Status
## Min. : 0.00000 Min. : 0.00000 Married :39821
## 1st Qu.: 0.00000 1st Qu.: 0.00000 Not Married:40032
## Median : 0.00000 Median : 0.00000
## Mean : 0.07809 Mean : 0.05994
## 3rd Qu.: 0.00000 3rd Qu.: 0.00000
## Max. :17.00000 Max. :11.00000
## Veh_Owned No_of_dep Accomodation risk_score no_of_premiums_paid
## 1:26746 1:19840 Owned :40030 Min. :91.90 Min. : 2.00
## 2:26520 2:19902 Rented:39823 1st Qu.:98.83 1st Qu.: 7.00
## 3:26587 3:20215 Median :99.18 Median :10.00
## 4:19896 Mean :99.07 Mean :10.86
## 3rd Qu.:99.52 3rd Qu.:14.00
## Max. :99.89 Max. :60.00
## sourcing_channel residence_area_type premium default
## A:43134 Rural:31670 Min. : 1200 Defaulted : 4998
## B:16512 Urban:48183 1st Qu.: 5400 Not Defaulted:74855
## C:12039 Median : 7500
## D: 7559 Mean :10925
## E: 609 3rd Qu.:13800
## Max. :60000
## age agegroup
## Min. : 20.00 Min. :1.000
## 1st Qu.: 40.00 1st Qu.:3.000
## Median : 50.00 Median :4.000
## Mean : 50.91 Mean :3.643
## 3rd Qu.: 61.00 3rd Qu.:5.000
## Max. :102.00 Max. :8.000
RECHECKING VARIABLES:
The structure of the data now shows that we have variables in numerical and factor categories. Each of them will accordingly be treated and analysed to explore the dataset to understand the characteristic of each ID.
Among the factor variables we observe:
Number of vehicles owned = almost equal number of people own 1, 2 and 3 number of vehicles.
Number of Dependents = again, an almost similar amount of dependents in the 1, 2, 3 & 4 category.
Accommodations owned = not much difference in the numbers who own their own accommodation v/s those who dont.
Sourcing Channel= 54% of the customers are sourced from Source Channel A. while Source Channel E does not amount to even 1%.
Residential Area Type = 60% come from urban area
Premium Defaults = 6.26% of the cohort have defaulted on paying the premiums.
Age = The average age is in the 50s, while the range is wide. There seem to outlier with the maximum age of 102 years.
##
## Defaulted Not Defaulted
## 6.259001 93.740999
DEFAULT VARIABLE: The table split shows that 6.26% of the people have defaulted in their payment of insurance premium
1. Observations on Age: * There seems an almost normal distribution in the Age of the customers. * The range is spread from 21 to 92 with approximately nine outliers between 93 and 102 * The range is concentrated around 41 years to 62 years * Mean & Median, both are around 51 suggesting there aren’t many outliers influencing the mean.
2. Observations on Age Group:
* Age Group 3 sees the maximum amount of individuals, followed by Group 4 * The median falls around Group 4
3. Observation on Income:
* The income range is very widely dispersed. The mean is 208,850 where the median is 166,560 which denotes some very extreme outliers are influencing the mean. * There 3rd quartile is at 252,090 and the maximum stretches to 9,0262,600 which shows the income range is widely dispersed with some making very huge amount compared to the concentration which lie between 108,010 & 252,090
4. Observation on Premium paid in Cash:
5 Observation on late payment of Premium by 3 to 6 months:
6.Observation on late payment of Premium by 6 to 12 months
7. Observation on late payment of Premium by more than 12 months:
8. Observation on Risk score of customers:
9.Observation on Premiums paid by customers:
10. Observation on the number of Premiums paid by the customers
Marital Status: Not much difference in the 2 cohorts, with a slight edge for the unmarried.
Accommodation: Again not much difference in the ones owning their houses and renting them. A slight edge to the ones owning their houses.
Residence Area: There we see 60% of the customers coming from the urban area
Number of vehicles owned: Again equally distributed at around 33% for 1, 2 & 3 vehicles owned by the customers
Number of Dependents: The four cohorts i.e. 1, 2, 3 & 4 have similar numbers in the data. All are between 24% & 25% of the share with a slight edge for 3 dependent at 25.31%.
Sourcing Channels: of the five cohorts i.r. A,B,C,D & E the bulk of the customers at 54%vhave been sourced by Channel A. Substantial amount of customers come from Channel B (20.7%), Channel C (15%) & Channel D (9.5%)
*We can see Sourcing Channels & Residence Area are the only two verticals from where we are able to see a diversion in the customer data.
Age shows a almost normal distribution spread widely between 20 & 90, with the bulk between 20 & 90.
Age Groups follows the Age distribution, with highest concentration in Group 3 followed by Group 4. The median is at Group 4.
Income levels seems dispersed unevenly in the spread.
Premiums paid sees a right skew with a sharp dip in between the rise. The concentrations is between 5400 & 13800.
The number of Premiums Paid seems to have a normal distribution with a positive skew. The concentration is between 7 & 14. Many outliers far & wide up to 60.
Risk Score sees a left skew with the concentration between 98.83 & 99.52. The tail is between 91.90 and 98.
Premium late by 3 to 6 Months, 6 to 12 Months & more than 12 Months:
Looking at the three cohorts above who have been late in paying their premiums on time, its apparent that maximum numbers in all three verticals are the ones who has not been late in paying their premiums on time.
There seems comparatively more people who have delayed paying their premiums 1 or 2 time between 3-6 months compared to the ones who delayed their premiums between 6-12 months & beyond 12 months.
Age: The Defaulters are comparatively a younger cohort with the median at 46. The concentration range being between 37 & 54. The non-defaulters are at a median of 51 with a range between 41 & 62.
Age Group: The defaulters are falling in the concentration range between Group 2 & 4 with a median at 3. The non defaulters are ina a range of Group 3 & 5 with median of 4 which reflects that the concentration of defaulter comparatively fall iin a less age bracket
Income: The income levels of the defaulters show that they come from a low income category compared to the non-defaulters.
Delayed Premium of 3 to 6 Months: + More than 85% of the total non defaulters have never defaulted premium by 3 to 6 months. 10% defaulted once & 2.5% defaulted twice. + More than 53% of the total defaulters have never defaulted premium by 3 to 6 months. 23% defaulted once & more than 11% defaulted twice
Delayed Premium of 6 to 12 Months: + Almost 97% of the total non defaulters have never defaulted premium by 6 to 12 months. 2.5% defaulted once. + 70% of the total defaulters have never defaulted premium by 6 to 12 months. 16.5% defaulted once.
Delayed Premium of more than 12 Months: + 96.5% of the total non defaulters have never defaulted premium by more than 12 months, Approx 3% defaulted once. + More than 76% of the total defaulters have never defaulted premium by more than 12 months. 16.7% defaulted once.
Nos. of Premiums Paid: Both the cohorts i.e. of defaulters and non- defaulters show a similar range in the number of premiums paid. Both cohorts have a median of 10 and almost similar range.
Premium Paid:Both cohorts has the same median of 7500. The range for defaulters is comparatively smaller.
Observation:
*Same kind of data distribution witnessed between defaulters and non-defaulters in the “Marital Status”, “Number of Vehicles owned”, “Number of Dependents” & “Accommodations” & “Residence Area”
Observations:
High Correlation - None of the verticals have a high correlation
Correlation - We can see a relative correlation between “Premium” & “Income” , “Late Payment of Premium by 3 to 6 months” & “Late Payment of Premium over 12 months”
Low Correlation - We see a slight correlation between “Premium paid in cash” & “Late Payment of Premium by 3 to 6 months”, “Premium paid in cash” & “Late Payment of Premium by 6 to 12 months”, “Age” & Sourcing Channel A“,”Age" & “Number of Premium Paid”
Observations:
The Density Plot close high density between :
“Late Payment of Premium by 6 to 12 months” & “Late Payment of Premium by more than 12 months” = 0.27
“Premium Paid in Cash” & “Late Payment of Premium by 3 to 6 months” = 0.21
“Premium Paid in Cash” & “Late Payment of Premium by 6 to 12 months” = 0.21
“Late Payment of Premium by 3 to 6 months” & “Late Payment of Premium by 6 to 12 months” = 0.20
“Premium” & “Nos. of Premiums” = 0.19
“Nos. of Premiums Paid” & “Age” = 0.18
“Premium Paid in Cash” & “Late Payment of Premium by more than 12 months” = 0.17
“Sourcing Channel” & “Premium” = 0.14
“Premium” & “Risk Score” = 0.13
Observations:
— 15% of the defaulters earlier delaying payment on premiums by 3 months to 6 months. — 3% of the defaulters earlier delaying payment on premiums by 6 months to 12 months. — 3.5% of the defaulters earlier delaying payment on premiums by more than 12 months.